Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu/drcbearm64.cpp: Optimise load/store and call generation #13307

Merged
merged 2 commits into from
Feb 1, 2025

Conversation

cuavas
Copy link
Member

@cuavas cuavas commented Feb 1, 2025

This should implement some of the optimisations previously discussed for AArch64 code generation:

  • bl displacement is in words
  • The emit_*_mem functions know the operand size, so they can pass the corresponding shift to emit_ldr_str_base_mem rather than trying to calculate it after the fact
  • An immediate load/store offset can be either a 9-bit signed byte offset or an unsigned 12-bit element offset
  • An unsigned 12-bit element offset can always reach an entire page, so the page-relative access can always be done in two instructions

And one bug fix:

  • AArch64 doesn’t allow a variable left shift for a register offset, it only allows zero or the element size, so there’s no point testing intermediate shift values.

@987123879113 and/or @rb6502 can you check this out and test it?

@cuavas
Copy link
Member Author

cuavas commented Feb 1, 2025

I just realised the 12-bit unsigned offset can only reach an entire page for aligned accesses. Hopefully the vast majority of accesses are aligned anyway.

There’s still one form of ldr we aren’t trying to use – the PC-relative form with a 19-bit signed displacement in words ±1MB reach). It can only be used to load a word or doubleword (not a byte or halfword, or a floating-point type), and there’s no equivalent str form.

@987123879113
Copy link
Contributor

Code looks fine to me. I ran it through a few games + the tester and no issues from what I can tell.

@rb6502
Copy link
Contributor

rb6502 commented Feb 1, 2025

A few before/after -str 90 -nothrottle runs:

shienryu (SH2)       414.39%   406.62%
s1945iii (SH2)      1319.27%  1296.48%
toyfight (SH4)       153.04%   168.68%
dc (SH4)             139.08%   142.06% (w/software list revilcv)
calspeed (MIPS)      332.66%   539.99%
kinst2 (MIPS)        729.66%   792.97%
gradius4 (PPC)       506.45%   501.04%
scud (PPC)           103.09%   103.20%

Very beneficial for MIPS, pretty much a wash otherwise.

@cuavas cuavas merged commit cdfb07c into mamedev:master Feb 1, 2025
5 checks passed
@cuavas cuavas deleted the a64offsets branch February 1, 2025 20:36
@cuavas
Copy link
Member Author

cuavas commented Feb 1, 2025

Those MIPS results are kind of insane. Is it doing an inordinate number of fastram accesses or something?

@rb6502
Copy link
Contributor

rb6502 commented Feb 1, 2025

Not sure what it was doing. Tried it again just now and the new code is still faster, but it's a much more normal delta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants